Skip to content

perf(core): Eliminate stat() syscall and lazy-load encoding libraries in fileRead#1399

Closed
yamadashy wants to merge 1 commit intomainfrom
perf/eliminate-stat-syscall
Closed

perf(core): Eliminate stat() syscall and lazy-load encoding libraries in fileRead#1399
yamadashy wants to merge 1 commit intomainfrom
perf/eliminate-stat-syscall

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Apr 5, 2026

Two optimizations to reduce file reading overhead:

1. Remove redundant fs.stat() before fs.readFile()

  • Previously each file required stat() (size check) then readFile() = 2 syscalls
  • Now readFile() runs first, then buffer.length is checked = 1 syscall
  • Files exceeding maxFileSize (default 10MB) are rare; the occasional oversized read is acceptable for halving syscall count on all files

2. Lazy-load jschardet and iconv-lite

  • These libraries have ~25ms combined import cost at startup
  • The fast UTF-8 path (covers ~99% of source code files) never needs them
  • They are only loaded on first encounter of a non-UTF-8 file

Extracted from #1377. Pure performance optimization with no functional changes.

Checklist

  • Run npm run test
  • Run npm run lint

Open with Devin

… in fileRead

Two optimizations to reduce file reading overhead:

1. Remove redundant fs.stat() before fs.readFile()
   - Previously each file required stat() (size check) then readFile() = 2 syscalls
   - Now readFile() runs first, then buffer.length is checked = 1 syscall
   - Files exceeding maxFileSize (default 10MB) are rare; the occasional
     oversized read is acceptable for halving syscall count on all files

2. Lazy-load jschardet and iconv-lite
   - These libraries have ~25ms combined import cost at startup
   - The fast UTF-8 path (covers ~99% of source code files) never needs them
   - They are only loaded on first encounter of a non-UTF-8 file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 5, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: feff0237-cb02-46e8-a5d9-bc651de02a8d

📥 Commits

Reviewing files that changed from the base of the PR and between 611b88a and f682fe9.

📒 Files selected for processing (1)
  • src/core/file/fileRead.ts

📝 Walkthrough

Walkthrough

The PR optimizes file reading in readRawFile by eliminating a redundant fs.stat() syscall (now uses buffer length for size validation), adds early logging, and implements lazy loading of encoding detection dependencies via a cached helper that dynamically imports jschardet and iconv-lite only on the non-UTF-8 slow path.

Changes

Cohort / File(s) Summary
File Read Optimization
src/core/file/fileRead.ts
Consolidates fs.stat() + conditional logic into single fs.readFile() call; replaces stats.size check with buffer.length for max file size validation; implements lazy loading of jschardet and iconv-lite via cached getEncodingDeps() helper using dynamic import(); adds early "Reading file" log statement.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • PR #1007: Modifies encoding detection and decoding logic in fileRead.ts using jschardet and iconv with conditional skip behavior.
  • PR #1303: Eliminates fs.stat() call in favor of single fs.readFile() with buffer.length for size checks in file read path.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/eliminate-stat-syscall

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

⚡ Performance Benchmark

Latest commit:f682fe9 perf(core): Eliminate stat() syscall and lazy-load encoding libraries in fileRead
Status:✅ Benchmark complete!
Ubuntu:1.59s (±0.03s) → 1.57s (±0.03s) · -0.02s (-1.0%)
macOS:1.46s (±0.23s) → 1.44s (±0.15s) · -0.01s (-1.0%)
Windows:2.03s (±0.02s) → 1.97s (±0.03s) · -0.06s (-2.9%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded), interleaved execution
  • Measurement: 20 runs / 30 on macOS (median ± IQR)
  • Workflow run

@yamadashy yamadashy closed this Apr 5, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.41%. Comparing base (611b88a) to head (f682fe9).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1399      +/-   ##
==========================================
+ Coverage   87.40%   87.41%   +0.01%     
==========================================
  Files         116      116              
  Lines        4389     4393       +4     
  Branches     1018     1019       +1     
==========================================
+ Hits         3836     3840       +4     
  Misses        553      553              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes file reading by lazy-loading encoding detection libraries and removing the fs.stat syscall to reduce I/O operations. Feedback indicates a potential race condition and type safety issues in the lazy-loading logic. Additionally, there is a concern that removing the early size check via fs.stat could lead to high memory consumption or out-of-memory errors when processing very large files.

Comment thread src/core/file/fileRead.ts
Comment on lines +9 to +16
let _jschardet: typeof import('jschardet') | undefined;
let _iconv: typeof import('iconv-lite') | undefined;
const getEncodingDeps = async () => {
if (!_jschardet || !_iconv) {
[_jschardet, _iconv] = await Promise.all([import('jschardet'), import('iconv-lite')]);
}
return { jschardet: _jschardet, iconv: _iconv };
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of getEncodingDeps has a potential race condition. If multiple concurrent calls to readRawFile trigger the slow path (non-UTF-8 files) before the first import() completes, the modules will be imported multiple times. Additionally, the return type of the function includes undefined for both dependencies, which will cause TypeScript errors or runtime crashes when calling encodingDeps.jschardet.detect() at line 73.

Using a single promise to manage the lazy loading ensures that the modules are only loaded once and provides a clean, non-nullable return type for the caller.

let encodingDepsPromise: Promise<{
  jschardet: typeof import('jschardet');
  iconv: typeof import('iconv-lite');
}> | undefined;

const getEncodingDeps = async () => {
  if (!encodingDepsPromise) {
    encodingDepsPromise = Promise.all([
      import('jschardet'),
      import('iconv-lite'),
    ]).then(([jschardet, iconv]) => ({
      jschardet: (jschardet as any).default || jschardet,
      iconv: (iconv as any).default || iconv,
    }));
  }
  return encodingDepsPromise;
};

Comment thread src/core/file/fileRead.ts
// Read the file directly and check size afterward, avoiding a separate stat() syscall.
// This halves the number of I/O operations per file.
// Files exceeding maxFileSize are rare, so the occasional oversized read is acceptable.
const buffer = await fs.readFile(filePath);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Removing the fs.stat check before fs.readFile introduces a risk of high memory consumption. While it reduces the number of syscalls, any file exceeding maxFileSize (e.g., a large log file, dataset, or binary not caught by extension checks) will now be fully loaded into memory before the size check occurs. In extreme cases, this could lead to OutOfMemoryError or process crashes if the file is very large (e.g., several GBs).

Consider if the performance gain of one less stat syscall (which is metadata-only and very fast) outweighs the safety of the early size check, especially for a tool that might be run on diverse and potentially large repositories.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 5, 2026

Code Review — perf(core): Eliminate stat() syscall and lazy-load encoding libraries

Clean, well-scoped PR with good inline documentation. Two observations worth considering:


1. Memory impact of reading oversized files before rejection

Priority: Recommended

Removing stat() means files are now fully read into memory before the size check. Combined with:

  • maxFileSize default of 50 MB (not 10 MB as the PR description states)
  • FILE_COLLECT_CONCURRENCY = 50 parallel workers

The theoretical worst case is 50 × 50 MB = 2.5 GB of heap allocation before any file is discarded — e.g., a repo with many large .sql or .csv files that pass the binary-extension check.

The old stat()-first approach rejected these with near-zero memory cost. For typical repos this is fine, but for the MCP server use case (potentially untrusted repos), it's worth acknowledging.

Possible mitigations (not necessarily needed now)
  • Add a comment noting the concurrency × maxFileSize memory ceiling
  • Consider fs.open() + fstat() on the fd (single open, no TOCTOU race, no full read) if this becomes an issue
  • Or cap concurrent reads when file sizes are unknown

On the positive side: removing the separate stat() call eliminates the TOCTOU race that existed before (file could be swapped between stat and read). That's a security improvement.


2. Lazy-load caching pattern — consider a singleton promise

Priority: Recommended

The current pattern with two separate let variables has a benign race under concurrency: if multiple files fail UTF-8 decoding simultaneously, all concurrent callers see _jschardet === undefined before any await resoletes, triggering redundant Promise.all calls.

A cleaner idiomatic pattern that also fixes the TypeScript type narrowing:

let _encodingDepsPromise: Promise<{ jschardet: typeof import('jschardet'); iconv: typeof import('iconv-lite') }> | undefined;
const getEncodingDeps = () => {
  if (!_encodingDepsPromise) {
    _encodingDepsPromise = Promise.all([import('jschardet'), import('iconv-lite')])
      .then(([jschardet, iconv]) => ({ jschardet, iconv }));
  }
  return _encodingDepsPromise;
};

This is the standard singleton-promise pattern: one cached Promise, no split-state risk, correct under concurrency, and the resolved type is non-optional.


Minor notes

  • PR description says "default 10MB" — the actual default is 50 MB (configSchema.ts:84). Worth correcting for accuracy.
  • Test coverage is adequate for the behavioral changes. One gap worth considering: there's no test for a valid non-UTF-8 file decoded successfully (e.g., Latin-1/Shift-JIS). The existing slow-path test only covers the rejection branch, not the success branch — so a getEncodingDeps wiring regression wouldn't be caught.
  • Leading underscore naming (_jschardet, _iconv) — Airbnb style discourages dangling underscores. Minor nit.

Overall this is a solid performance optimization. The lazy-load of encoding libraries is clearly beneficial (~25ms startup saving, UTF-8 fast path is great). The stat removal is a reasonable tradeoff for typical usage. 👍

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant